According to Centers for Disease Control and Prevention, in 2015-2016, more than 12% of adults at the age of 20 and above in the United States have cholesterol levels higher than 240mg/ml. A high cholesterol level is one of the major risk factors for coronary heart disease, heart attack and stroke. While there are two types of cholesterol, low-density lipoprotein (LDL) is known as the “bad” one that leads to the buildup of cholesterol in arteries. The goal of this project is to explore factors that can affect the level of LDL in a human body, and the result is intended to help readers gain insights on how to balance the level of LDL in order to control/prevent cardiovascular diseases.
The analysis result shows that …
The data analysis is performed with the NHANES 2015-2016 data. The dependent variable, the level of low-density lipoprotein (LDL), measured by mg/dl, is selected from the dataset Cholesterol - LDL & Triglycerides of the laboratory data. Since a high level of triglyceride is believed to be associated with a high level of LDL, it is included in the model as the independent variable. Triglyceride data is also obtained from Cholesterol - LDL & Triglycerides of the laboratory data. We also include blood pressure readings from Blood Pressure dataset of the examination data. According to the data description, some participants have multiple blood pressure readings. For simplicity, we use averaged systolic and diastolic blood pressure readings as blood pressure measurements for each individual. Averaged intakes of fat and cholesterol, computed using both the First and the Second Day Total Nutrient Intakes of the dietary data, are added as independent variables as well. To account for more individual differences, we also include gender, race and age from the Demographics Data, and height, weight and BMI information from Body Measures of the examination data as additional covariates. Furthermore, SEQN, the respondent sequence number, is utilized as the unique identifier to match responses for each respondent. Finally, we removed all rows containing missing values, and there are a total of 2503 observations available for further analysis.
We fit models using multiple linear regression techniques and then perform model selections to choose the model that best describes the level of LDL. For the very first model, we regress the dependent variable LDL on all predictors:
\(\mathbf{LDL}\) ~ \(\mathbf{age + race + gender + height + weight + BMI + fat + cholesterol + triglyceride + diastolic + systolic}\) (1)
Note that covariates gender and race are treated as categorical variables.
A check of the relationship between residuals and fitted values suggests a transformation for the dependent variable. With the help of the Box-Cox test (attach the code? plot), we identify that the square root transformation is the best choice. We then fit a new linear model with the transformed dependent variable, \(\sqrt{LDL}\):
\(\mathbf{\sqrt{LDL}}\) ~ \(\mathbf{age + race + gender + height + weight + BMI + fat + cholesterol + triglyceride + diastolic + systolic}\) (2)
As the model has as many as 11 covariates and some of them are insignificant under t-test, we then use the stepwise selection technique to choose variables that best explain \(\sqrt{LDL}\).
Besides, we consider transformations upon predictors. Considering partial residual plots (attach plots), we find out that variables age and triglycerides violate the linear structure assumption. Both of these plots exhibit a quadratic form, so in addition to response variables in the full model (2) above, we add age2 and triglycerides2 to the linear regression model:
\(\mathbf{\sqrt{LDL}}\) ~ \(\mathbf{age + age^2 + race + gender + height + weight + BMI + fat + cholesterol + triglyceride + triglyceride^2 + diastolic + systolic}\) (3)
Just as the process before, we will execute the stepwise model selection technique to choose the most significant variables of this model.
Steps outlined above are carried out in R, Stata and Python. In R, we use the package data.table for data cleaning.
## Group Project HTML
## Author: Huayu Li, huayuli@umich.edu
## Updated: Dec. 8 2019
#### Data cleaning using data.table
## Libraries: -------------------------------------------------------------------------
library(data.table)
library(foreign)
library(tidyverse)
## 80: --------------------------------------------------------------------------------
## Read the datasets
demo=data.table(read.xport("./Original Data/DEMO_I.XPT.txt"))
tot1=data.table(read.xport("./Original Data/DR1TOT_I.XPT.txt"))
tot2=data.table(read.xport("./Original Data/DR2TOT_I.XPT.txt"))
b_pres=data.table(read.xport("./Original Data/BPX_I.XPT.txt"))
ldl=data.table(read.xport("./Original Data/TRIGLY_I.XPT.txt"))
measure=data.table(read.xport("./Original Data/BMX_I.XPT.txt"))
## For each dataset, choose the proper variables and make
## some transformation.
# For demo dataset, we choose seqn, gender, age and race variables
Demo=demo[,.(seqn=SEQN,gender=as.factor(RIAGENDR),
age=RIDAGEYR,race=as.factor(RIDRETH3))]
# For dietary data, we choose seqn, intake fat, intake cholesterol
# for each day.
TOT1=tot1[,.(seqn=SEQN,intake_fat1=DR1TTFAT,
intake_chol1=DR1TCHOL)]
TOT2=tot2[,.(seqn=SEQN,intake_fat2=DR2TTFAT,
intake_chol2=DR2TCHOL)]
## Next we will use the average intake of the two days into
## our model. The average step is as following:
intake_type=c('intake_fat1','intake_fat2','intake_chol1','intake_chol2')
TOT=TOT1%>%merge(.,TOT2,by='seqn',all=FALSE)
TOT=melt(TOT,measure=intake_type)[
,.(seqn,type=factor(variable,intake_type,c(rep('intake_fat',times=2),
rep('intake_chol',times=2))),
variable,value)
][
,.(intake=mean(value,na.rm=TRUE)),by=.(seqn,type)
]
TOT=dcast(TOT,...~type,value.var=c('intake'))
# For blood pressure, we choose seqn, systolic pressures and diastolic
# pressures. We then use the average pressure as the final pressure.
pres_type=c(paste('sys',1:4,sep=''),paste('dia',1:4,sep=''))
B_pres=b_pres[,.(seqn=SEQN,sys1=BPXSY1,sys2=BPXSY2,sys3=BPXSY3,sys4=BPXSY4,
dia1=BPXDI1,dia2=BPXDI2,dia3=BPXDI3,dia4=BPXDI4)]
B_pres=melt(B_pres,measure=pres_type)[,
.(seqn,type=factor(variable,pres_type,
c(rep('s',times=4),rep('d',times=4))),
variable,pressure=value)
][
,.(pres=mean(pressure,na.rm=TRUE)),by=.(seqn,type)
]
B_pres=dcast(B_pres,...~type,value.var=c('pres'))[
,.(seqn,systolic=s,diastolic=d)
]
# For ldl dataset, we choose seqn, LDL-cholesterol and Triglyceride
# for mg/dL.
LDL=ldl[,.(seqn=SEQN,ldl=LBDLDL,triglycerides=LBXTR)]
# For body measure dataset, we choose weight height and bmi as our
# variables.
Measure=measure[,.(seqn=SEQN,weight=BMXWT,height=BMXHT,bmi=BMXBMI)]
## Now merge the datasets into one whole, with the seqn as
## the merging label. By the way, some seqn labels
## should be removed, for they are not included in LDL dataset.
Data=Demo%>%merge(.,TOT,by='seqn',all=FALSE)%>%
merge(.,B_pres,by='seqn',all=FALSE)%>%
merge(.,LDL,by='seqn',all=FALSE)%>%
merge(.,Measure,by='seqn',all=FALSE)%>%
na.omit()
## Now save the cleaned data
fwrite(Data,file='Final.csv')
### Using this file for regression: ---------------------------------------------------
## Libraries: -------------------------------------------------------------------------
library(lme4)
library(MASS)
library(car)
## 80: --------------------------------------------------------------------------------
## Remove the seqn variable, and set gender and race as factor variables
DT=Data[,.(gender=as.factor(gender),age,race=as.factor(race),intake_fat,intake_chol,
systolic,diastolic,ldl,triglycerides,weight,height,bmi)]
## First of all, we will fit the model with all variables, and then give
## the residual plot of the model.
L1=lm(ldl~.,data=DT)
summary(L1)
##
## Call:
## lm(formula = ldl ~ ., data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113.646 -22.757 -2.451 19.297 149.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.571467 44.874374 2.263 0.0237 *
## gender2 1.171152 1.847729 0.634 0.5262
## age 0.175782 0.038684 4.544 5.78e-06 ***
## race2 3.952555 2.414177 1.637 0.1017
## race3 2.036991 2.093998 0.973 0.3308
## race4 3.506045 2.285949 1.534 0.1252
## race6 5.104577 2.738375 1.864 0.0624 .
## race7 6.036927 3.880265 1.556 0.1199
## intake_fat 0.005288 0.022726 0.233 0.8160
## intake_chol -0.001265 0.004397 -0.288 0.7736
## systolic -0.005915 0.045972 -0.129 0.8976
## diastolic 0.401837 0.057452 6.994 3.41e-12 ***
## triglycerides 0.142759 0.011379 12.546 < 2e-16 ***
## weight 0.301651 0.266228 1.133 0.2573
## height -0.318111 0.269684 -1.180 0.2383
## bmi -0.595490 0.741391 -0.803 0.4219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.34 on 2487 degrees of freedom
## Multiple R-squared: 0.1342, Adjusted R-squared: 0.129
## F-statistic: 25.69 on 15 and 2487 DF, p-value: < 2.2e-16
plot(L1$fitted.values,L1$residuals)
## Here, it seems that some transformations should be used upon ldl. Here
## we do the Box-Cox test.
boxcox(L1,plotit=TRUE,lambda=seq(0,1,1/100))
## Here it seems that lambda=0.5 is the best choice, that is, to use sqrt(ldl).
## Here we make the transformation and then do the regression again.
L2=lm(sqrt(ldl)~.,data=DT)
summary(L2)
##
## Call:
## lm(formula = sqrt(ldl) ~ ., data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1905 -1.0601 0.0021 1.0178 5.5148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.005e+01 2.158e+00 4.658 3.36e-06 ***
## gender2 5.965e-02 8.886e-02 0.671 0.5021
## age 8.429e-03 1.860e-03 4.531 6.15e-06 ***
## race2 1.830e-01 1.161e-01 1.576 0.1151
## race3 8.270e-02 1.007e-01 0.821 0.4116
## race4 1.305e-01 1.099e-01 1.187 0.2354
## race6 2.302e-01 1.317e-01 1.748 0.0806 .
## race7 2.744e-01 1.866e-01 1.470 0.1416
## intake_fat 4.259e-04 1.093e-03 0.390 0.6968
## intake_chol -8.613e-05 2.115e-04 -0.407 0.6838
## systolic -3.532e-04 2.211e-03 -0.160 0.8731
## diastolic 2.006e-02 2.763e-03 7.261 5.11e-13 ***
## triglycerides 6.538e-03 5.472e-04 11.948 < 2e-16 ***
## weight 1.492e-02 1.280e-02 1.165 0.2440
## height -1.644e-02 1.297e-02 -1.268 0.2050
## bmi -2.700e-02 3.566e-02 -0.757 0.4489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.604 on 2487 degrees of freedom
## Multiple R-squared: 0.1327, Adjusted R-squared: 0.1275
## F-statistic: 25.38 on 15 and 2487 DF, p-value: < 2.2e-16
## There are too many variables in the regression model, so here we will do
## the model selection and choose the variables. Here we do both the forward
## and backward selections.
L3_a=step(L2,direction='forward')
## Start: AIC=2380.04
## sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + triglycerides + weight + height +
## bmi
L3_b=step(L2,direction='backward')
## Start: AIC=2380.04
## sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + triglycerides + weight + height +
## bmi
##
## Df Sum of Sq RSS AIC
## - race 5 13.64 6409.0 2375.4
## - systolic 1 0.07 6395.5 2378.1
## - intake_fat 1 0.39 6395.8 2378.2
## - intake_chol 1 0.43 6395.8 2378.2
## - gender 1 1.16 6396.6 2378.5
## - bmi 1 1.47 6396.9 2378.6
## - weight 1 3.49 6398.9 2379.4
## - height 1 4.13 6399.5 2379.7
## <none> 6395.4 2380.0
## - age 1 52.79 6448.2 2398.6
## - diastolic 1 135.59 6531.0 2430.6
## - triglycerides 1 367.08 6762.5 2517.7
##
## Step: AIC=2375.37
## sqrt(ldl) ~ gender + age + intake_fat + intake_chol + systolic +
## diastolic + triglycerides + weight + height + bmi
##
## Df Sum of Sq RSS AIC
## - systolic 1 0.09 6409.1 2373.4
## - intake_fat 1 0.14 6409.2 2373.4
## - intake_chol 1 0.40 6409.4 2373.5
## - gender 1 1.43 6410.5 2373.9
## - bmi 1 1.95 6411.0 2374.1
## - weight 1 3.95 6413.0 2374.9
## - height 1 4.35 6413.4 2375.1
## <none> 6409.0 2375.4
## - age 1 54.06 6463.1 2394.4
## - diastolic 1 144.08 6553.1 2429.0
## - triglycerides 1 384.56 6793.6 2519.2
##
## Step: AIC=2373.41
## sqrt(ldl) ~ gender + age + intake_fat + intake_chol + diastolic +
## triglycerides + weight + height + bmi
##
## Df Sum of Sq RSS AIC
## - intake_fat 1 0.15 6409.3 2371.5
## - intake_chol 1 0.39 6409.5 2371.6
## - gender 1 1.54 6410.7 2372.0
## - bmi 1 1.96 6411.1 2372.2
## - weight 1 3.95 6413.1 2372.9
## - height 1 4.33 6413.5 2373.1
## <none> 6409.1 2373.4
## - age 1 65.74 6474.9 2396.9
## - diastolic 1 156.87 6566.0 2431.9
## - triglycerides 1 384.54 6793.7 2517.2
##
## Step: AIC=2371.47
## sqrt(ldl) ~ gender + age + intake_chol + diastolic + triglycerides +
## weight + height + bmi
##
## Df Sum of Sq RSS AIC
## - intake_chol 1 0.24 6409.5 2369.6
## - gender 1 1.52 6410.8 2370.1
## - bmi 1 1.94 6411.2 2370.2
## - weight 1 3.92 6413.2 2371.0
## - height 1 4.24 6413.5 2371.1
## <none> 6409.3 2371.5
## - age 1 65.61 6474.9 2395.0
## - diastolic 1 157.88 6567.2 2430.4
## - triglycerides 1 384.84 6794.1 2515.4
##
## Step: AIC=2369.56
## sqrt(ldl) ~ gender + age + diastolic + triglycerides + weight +
## height + bmi
##
## Df Sum of Sq RSS AIC
## - gender 1 1.69 6411.2 2368.2
## - bmi 1 1.96 6411.5 2368.3
## - weight 1 3.93 6413.5 2369.1
## - height 1 4.28 6413.8 2369.2
## <none> 6409.5 2369.6
## - age 1 65.48 6475.0 2393.0
## - diastolic 1 157.90 6567.4 2428.5
## - triglycerides 1 384.88 6794.4 2513.5
##
## Step: AIC=2368.22
## sqrt(ldl) ~ age + diastolic + triglycerides + weight + height +
## bmi
##
## Df Sum of Sq RSS AIC
## - bmi 1 1.72 6412.9 2366.9
## - weight 1 3.67 6414.9 2367.7
## <none> 6411.2 2368.2
## - height 1 5.36 6416.6 2368.3
## - age 1 65.04 6476.2 2391.5
## - diastolic 1 159.26 6570.5 2427.6
## - triglycerides 1 383.55 6794.8 2511.7
##
## Step: AIC=2366.89
## sqrt(ldl) ~ age + diastolic + triglycerides + weight + height
##
## Df Sum of Sq RSS AIC
## <none> 6412.9 2366.9
## - height 1 13.59 6426.5 2370.2
## - weight 1 21.39 6434.3 2373.2
## - age 1 63.64 6476.6 2389.6
## - diastolic 1 158.13 6571.1 2425.9
## - triglycerides 1 388.91 6801.8 2512.3
summary(L3_a)
##
## Call:
## lm(formula = sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + triglycerides + weight + height +
## bmi, data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1905 -1.0601 0.0021 1.0178 5.5148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.005e+01 2.158e+00 4.658 3.36e-06 ***
## gender2 5.965e-02 8.886e-02 0.671 0.5021
## age 8.429e-03 1.860e-03 4.531 6.15e-06 ***
## race2 1.830e-01 1.161e-01 1.576 0.1151
## race3 8.270e-02 1.007e-01 0.821 0.4116
## race4 1.305e-01 1.099e-01 1.187 0.2354
## race6 2.302e-01 1.317e-01 1.748 0.0806 .
## race7 2.744e-01 1.866e-01 1.470 0.1416
## intake_fat 4.259e-04 1.093e-03 0.390 0.6968
## intake_chol -8.613e-05 2.115e-04 -0.407 0.6838
## systolic -3.532e-04 2.211e-03 -0.160 0.8731
## diastolic 2.006e-02 2.763e-03 7.261 5.11e-13 ***
## triglycerides 6.538e-03 5.472e-04 11.948 < 2e-16 ***
## weight 1.492e-02 1.280e-02 1.165 0.2440
## height -1.644e-02 1.297e-02 -1.268 0.2050
## bmi -2.700e-02 3.566e-02 -0.757 0.4489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.604 on 2487 degrees of freedom
## Multiple R-squared: 0.1327, Adjusted R-squared: 0.1275
## F-statistic: 25.38 on 15 and 2487 DF, p-value: < 2.2e-16
summary(L3_b)
##
## Call:
## lm(formula = sqrt(ldl) ~ age + diastolic + triglycerides + weight +
## height, data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2425 -1.0646 0.0056 1.0406 5.5286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8488834 0.5771649 15.332 < 2e-16 ***
## age 0.0079968 0.0016065 4.978 6.87e-07 ***
## diastolic 0.0203065 0.0025879 7.847 6.28e-15 ***
## triglycerides 0.0064965 0.0005279 12.306 < 2e-16 ***
## weight 0.0049061 0.0017000 2.886 0.00394 **
## height -0.0083689 0.0036375 -2.301 0.02149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.603 on 2497 degrees of freedom
## Multiple R-squared: 0.1304, Adjusted R-squared: 0.1286
## F-statistic: 74.87 on 5 and 2497 DF, p-value: < 2.2e-16
## By the way, in the models before, we didn't consider transformations
## upon predictors; in the coming part, we will consider adding some
## nonlinear terms.
crPlots(L2)
## From the partial residual plots, we can find out that for triglycerides and age,
## some nonlinear transformation forms should be add. We add this term, and the
## regression result is as following:
L4=lm(sqrt(ldl)~gender+age+race+intake_fat+intake_chol+systolic+diastolic+
weight+height+bmi+triglycerides+I(triglycerides^2)+I(age^2),data=DT)
summary(L4)
##
## Call:
## lm(formula = sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + weight + height + bmi + triglycerides +
## I(triglycerides^2) + I(age^2), data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2795 -0.9630 -0.0056 0.9958 5.4639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.155e+01 2.065e+00 5.593 2.47e-08 ***
## gender2 -7.780e-02 8.572e-02 -0.908 0.36421
## age 1.075e-01 8.753e-03 12.282 < 2e-16 ***
## race2 8.510e-02 1.109e-01 0.767 0.44310
## race3 1.954e-01 9.673e-02 2.020 0.04349 *
## race4 2.324e-01 1.053e-01 2.207 0.02739 *
## race6 1.303e-01 1.258e-01 1.035 0.30054
## race7 3.578e-01 1.782e-01 2.008 0.04472 *
## intake_fat -7.455e-05 1.043e-03 -0.071 0.94303
## intake_chol -1.705e-04 2.019e-04 -0.844 0.39857
## systolic 3.840e-03 2.141e-03 1.794 0.07298 .
## diastolic 6.303e-03 2.858e-03 2.206 0.02750 *
## weight 2.520e-02 1.223e-02 2.060 0.03950 *
## height -3.457e-02 1.246e-02 -2.775 0.00556 **
## bmi -7.122e-02 3.413e-02 -2.087 0.03703 *
## triglycerides 2.150e-02 1.660e-03 12.953 < 2e-16 ***
## I(triglycerides^2) -5.019e-05 5.005e-06 -10.027 < 2e-16 ***
## I(age^2) -1.138e-03 9.593e-05 -11.862 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 2485 degrees of freedom
## Multiple R-squared: 0.2114, Adjusted R-squared: 0.206
## F-statistic: 39.2 on 17 and 2485 DF, p-value: < 2.2e-16
## Just the same, do the model selection.
L5_a=step(L4,direction='forward')
## Start: AIC=2145.93
## sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + weight + height + bmi + triglycerides +
## I(triglycerides^2) + I(age^2)
L5_b=step(L4,direction='backward')
## Start: AIC=2145.93
## sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + weight + height + bmi + triglycerides +
## I(triglycerides^2) + I(age^2)
##
## Df Sum of Sq RSS AIC
## - race 5 17.66 5832.7 2143.5
## - intake_fat 1 0.01 5815.1 2143.9
## - intake_chol 1 1.67 5816.7 2144.7
## - gender 1 1.93 5817.0 2144.8
## <none> 5815.0 2145.9
## - systolic 1 7.53 5822.6 2147.2
## - weight 1 9.93 5825.0 2148.2
## - bmi 1 10.19 5825.2 2148.3
## - diastolic 1 11.38 5826.4 2148.8
## - height 1 18.02 5833.1 2151.7
## - I(triglycerides^2) 1 235.29 6050.3 2243.2
## - I(age^2) 1 329.26 6144.3 2281.8
## - age 1 353.01 6168.1 2291.4
## - triglycerides 1 392.60 6207.6 2307.4
##
## Step: AIC=2143.52
## sqrt(ldl) ~ gender + age + intake_fat + intake_chol + systolic +
## diastolic + weight + height + bmi + triglycerides + I(triglycerides^2) +
## I(age^2)
##
## Df Sum of Sq RSS AIC
## - intake_fat 1 0.01 5832.7 2141.5
## - gender 1 0.22 5832.9 2141.6
## - intake_chol 1 2.70 5835.4 2142.7
## <none> 5832.7 2143.5
## - systolic 1 8.37 5841.1 2145.1
## - weight 1 10.83 5843.5 2146.2
## - bmi 1 11.15 5843.9 2146.3
## - diastolic 1 11.83 5844.5 2146.6
## - height 1 15.11 5847.8 2148.0
## - I(triglycerides^2) 1 233.60 6066.3 2239.8
## - I(age^2) 1 329.37 6162.1 2279.0
## - age 1 351.52 6184.2 2288.0
## - triglycerides 1 396.27 6229.0 2306.0
##
## Step: AIC=2141.52
## sqrt(ldl) ~ gender + age + intake_chol + systolic + diastolic +
## weight + height + bmi + triglycerides + I(triglycerides^2) +
## I(age^2)
##
## Df Sum of Sq RSS AIC
## - gender 1 0.23 5832.9 2139.6
## - intake_chol 1 3.82 5836.5 2141.2
## <none> 5832.7 2141.5
## - systolic 1 8.36 5841.1 2143.1
## - weight 1 10.82 5843.5 2144.2
## - bmi 1 11.14 5843.9 2144.3
## - diastolic 1 11.91 5844.6 2144.6
## - height 1 15.12 5847.8 2146.0
## - I(triglycerides^2) 1 233.70 6066.4 2237.8
## - I(age^2) 1 329.38 6162.1 2277.0
## - age 1 351.51 6184.2 2286.0
## - triglycerides 1 396.26 6229.0 2304.0
##
## Step: AIC=2139.62
## sqrt(ldl) ~ age + intake_chol + systolic + diastolic + weight +
## height + bmi + triglycerides + I(triglycerides^2) + I(age^2)
##
## Df Sum of Sq RSS AIC
## - intake_chol 1 3.64 5836.6 2139.2
## <none> 5832.9 2139.6
## - systolic 1 8.72 5841.7 2141.4
## - weight 1 10.99 5843.9 2142.3
## - bmi 1 11.39 5844.3 2142.5
## - diastolic 1 11.85 5844.8 2142.7
## - height 1 14.93 5847.9 2144.0
## - I(triglycerides^2) 1 234.44 6067.4 2236.2
## - I(age^2) 1 331.94 6164.9 2276.2
## - age 1 354.72 6187.7 2285.4
## - triglycerides 1 399.15 6232.1 2303.3
##
## Step: AIC=2139.18
## sqrt(ldl) ~ age + systolic + diastolic + weight + height + bmi +
## triglycerides + I(triglycerides^2) + I(age^2)
##
## Df Sum of Sq RSS AIC
## <none> 5836.6 2139.2
## - systolic 1 9.00 5845.6 2141.0
## - weight 1 10.91 5847.5 2141.8
## - bmi 1 11.41 5848.0 2142.1
## - diastolic 1 12.16 5848.7 2142.4
## - height 1 15.47 5852.1 2143.8
## - I(triglycerides^2) 1 234.65 6071.2 2235.8
## - I(age^2) 1 328.68 6165.3 2274.3
## - age 1 351.28 6187.9 2283.5
## - triglycerides 1 399.35 6235.9 2302.8
summary(L5_a)
##
## Call:
## lm(formula = sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol +
## systolic + diastolic + weight + height + bmi + triglycerides +
## I(triglycerides^2) + I(age^2), data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2795 -0.9630 -0.0056 0.9958 5.4639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.155e+01 2.065e+00 5.593 2.47e-08 ***
## gender2 -7.780e-02 8.572e-02 -0.908 0.36421
## age 1.075e-01 8.753e-03 12.282 < 2e-16 ***
## race2 8.510e-02 1.109e-01 0.767 0.44310
## race3 1.954e-01 9.673e-02 2.020 0.04349 *
## race4 2.324e-01 1.053e-01 2.207 0.02739 *
## race6 1.303e-01 1.258e-01 1.035 0.30054
## race7 3.578e-01 1.782e-01 2.008 0.04472 *
## intake_fat -7.455e-05 1.043e-03 -0.071 0.94303
## intake_chol -1.705e-04 2.019e-04 -0.844 0.39857
## systolic 3.840e-03 2.141e-03 1.794 0.07298 .
## diastolic 6.303e-03 2.858e-03 2.206 0.02750 *
## weight 2.520e-02 1.223e-02 2.060 0.03950 *
## height -3.457e-02 1.246e-02 -2.775 0.00556 **
## bmi -7.122e-02 3.413e-02 -2.087 0.03703 *
## triglycerides 2.150e-02 1.660e-03 12.953 < 2e-16 ***
## I(triglycerides^2) -5.019e-05 5.005e-06 -10.027 < 2e-16 ***
## I(age^2) -1.138e-03 9.593e-05 -11.862 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 2485 degrees of freedom
## Multiple R-squared: 0.2114, Adjusted R-squared: 0.206
## F-statistic: 39.2 on 17 and 2485 DF, p-value: < 2.2e-16
summary(L5_b)
##
## Call:
## lm(formula = sqrt(ldl) ~ age + systolic + diastolic + weight +
## height + bmi + triglycerides + I(triglycerides^2) + I(age^2),
## data = DT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2347 -0.9663 -0.0208 1.0166 5.4756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.109e+01 2.011e+00 5.516 3.82e-08 ***
## age 1.047e-01 8.550e-03 12.249 < 2e-16 ***
## systolic 4.101e-03 2.091e-03 1.961 0.0500 *
## diastolic 6.455e-03 2.832e-03 2.279 0.0228 *
## weight 2.636e-02 1.221e-02 2.159 0.0310 *
## height -3.121e-02 1.214e-02 -2.571 0.0102 *
## bmi -7.503e-02 3.398e-02 -2.208 0.0274 *
## triglycerides 2.116e-02 1.620e-03 13.060 < 2e-16 ***
## I(triglycerides^2) -4.954e-05 4.948e-06 -10.011 < 2e-16 ***
## I(age^2) -1.105e-03 9.325e-05 -11.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.53 on 2493 degrees of freedom
## Multiple R-squared: 0.2085, Adjusted R-squared: 0.2057
## F-statistic: 72.98 on 9 and 2493 DF, p-value: < 2.2e-16
## By the way, we will have the linear mixed model upon the dataset:
LM=lmer(ldl~age+intake_fat+intake_chol+systolic+diastolic+
weight+height+bmi+triglycerides+(1|gender)+(1|race),data=DT)
summary(LM)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ldl ~ age + intake_fat + intake_chol + systolic + diastolic +
## weight + height + bmi + triglycerides + (1 | gender) + (1 | race)
## Data: DT
##
## REML criterion at convergence: 24691.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.3525 -0.6773 -0.0739 0.5917 4.4911
##
## Random effects:
## Groups Name Variance Std.Dev.
## race (Intercept) 0.695 0.8337
## gender (Intercept) 0.000 0.0000
## Residual 1111.545 33.3398
## Number of obs: 2503, groups: race, 6; gender, 2
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 111.161885 43.863459 2.534
## age 0.175052 0.038036 4.602
## intake_fat 0.001994 0.022497 0.089
## intake_chol -0.001512 0.004350 -0.348
## systolic -0.009931 0.045057 -0.220
## diastolic 0.411477 0.056723 7.254
## weight 0.309732 0.265604 1.166
## height -0.351071 0.264700 -1.326
## bmi -0.630334 0.737863 -0.854
## triglycerides 0.141182 0.011105 12.714
##
## Correlation of Fixed Effects:
## (Intr) age intk_f intk_c systlc distlc weight height bmi
## age 0.073
## intake_fat 0.047 0.053
## intake_chol -0.011 -0.066 -0.596
## systolic -0.074 -0.458 0.027 0.015
## diastolic 0.011 0.102 -0.063 0.033 -0.302
## weight 0.961 0.071 0.017 -0.012 -0.009 0.022
## height -0.992 -0.054 -0.071 0.009 -0.003 -0.051 -0.963
## bmi -0.957 -0.081 -0.016 0.004 -0.003 -0.040 -0.991 0.955
## triglycerds -0.073 -0.134 0.053 -0.039 -0.047 -0.081 -0.093 0.077 0.067
## convergence code: 0
## boundary (singular) fit: see ?isSingular
## Additional Analysis: Some extra graphs
## Libraries: -------------------------------------------------------------------------
library(ggplot2)
library(gridExtra)
## 80: --------------------------------------------------------------------------------
## Graph 1: Some Diagnosis upon these models
### F1: Checking error assumptions--residual plots
R1=data.table(fitted_values=L2$fitted.values,residuals=L2$residuals)
R2=data.table(fitted_values=L3_b$fitted.values,residuals=L3_b$residuals)
R3=data.table(fitted_values=L4$fitted.values,residuals=L4$residuals)
R4=data.table(fitted_values=L5_b$fitted.values,residuals=L5_b$residuals)
rs1=ggplot(R1,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
labs(title='Model 1')
rs2=ggplot(R2,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
labs(title='Model 2')
rs3=ggplot(R3,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
labs(title='Model 3')
rs4=ggplot(R4,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
labs(title='Model 4')
grid.arrange(rs1,rs2,rs3,rs4,nrow=2)
## Graph 2: QQ-plots of the models
par(mfrow=c(2,2))
qqnorm(R1$residuals, ylab="Residuals",main='Q-Q Plot of Model 1')
qqline(R1$residuals)
qqnorm(R2$residuals, ylab="Residuals",main='Q-Q Plot of Model 2')
qqline(R2$residuals)
qqnorm(R3$residuals, ylab="Residuals",main='Q-Q Plot of Model 3')
qqline(R3$residuals)
qqnorm(R4$residuals, ylab="Residuals",main='Q-Q Plot of Model 4')
qqline(R4$residuals)
## Graph 3: Partial Residual Plots upon Model 3 and 4
crPlots(L4)
crPlots(L5_b)
## Graph 4: Relationships between ldl and gender/race. We have the mean level of
## ldl between different gender and race (For CI, we will use the JackKnife
## standard error.)
Mean_JK = function(x){
lx=length(x)
MX=matrix(rep(x,rep(lx-1,lx)),ncol=lx,byrow=TRUE)
theta=colMeans(MX)
mean_theta=mean(theta)
std_theta={(lx-1)/lx*sum((theta-mean_theta)^2)}^(1/2)
std_theta
}
Gend=DT[,.(gender,ldl)]
Race=DT[,.(race,ldl)]
MG=Gend[,.(mean_ldl=mean(ldl),l_ldl=mean(ldl)+qnorm(0.025)*Mean_JK(ldl),
r_ldl=mean(ldl)+qnorm(0.975)*Mean_JK(ldl)),by=gender]
MR=Race[,.(mean_ldl=mean(ldl),l_ldl=mean(ldl)+qnorm(0.025)*Mean_JK(ldl),
r_ldl=mean(ldl)+qnorm(0.975)*Mean_JK(ldl)),by=race]
gend=ggplot(Gend,aes(x=gender,y=ldl))+geom_point(size=1,colour='blue')+
labs(title='ldl~gender')
race=ggplot(Race,aes(x=race,y=ldl))+geom_point(size=1,colour='blue')+
labs(title='ldl~race')
mean_gend=ggplot(MG,aes(x=gender,y=mean_ldl))+geom_point(shape=16,col='red')+
geom_segment(data=MG,mapping=aes(x=gender,xend=gender,y=l_ldl,yend=r_ldl),
col='blue')+
labs(title = 'Mean level of ldl between genders')
mean_race=ggplot(MR,aes(x=race,y=mean_ldl))+geom_point(shape=16,col='red')+
geom_segment(data=MR,mapping=aes(x=race,xend=race,y=l_ldl,yend=r_ldl),
col='blue')+
labs(title = 'Mean level of ldl between races')
grid.arrange(gend,race,mean_gend,mean_race,nrow=2)
* import demographic data
import sasxport "./Original Data/DEMO_I.XPT.txt", clear
* rename variables
rename riagendr gender
rename ridageyr age
rename ridreth3 race
* select variables of focus
keep seqn gender age race
* save the cleaned demographics data
save "./Xiru Lyu/Data/demo.dta", replace
* import diet (day 1) data
import sasxport "./Original Data/DR1TOT_I.XPT.txt", clear
* rename variables
rename dr1ttfat fat1
rename dr1tchol chol1
* select variables of interest
keep seqn fat1 chol1
* save the cleaned diet (day 1) dataset
save "./Xiru Lyu/Data/diet1.dta", replace
* import diet (day 2) data
import sasxport "./Original Data/DR2TOT_I.XPT.txt", clear
* rename variables
rename dr2ttfat fat2
rename dr2tchol chol2
* select variables of focus
keep seqn fat2 chol2
* save the cleaned diet (day 2) dataset
save "./Xiru Lyu/Data/diet2.dta", replace
* import LDL & triglyceride data
import sasxport "./Original Data/TRIGLY_I.XPT.txt", clear
* rename variables
rename lbdldl ldl
rename lbxtr triglyceride
* select variables of interest
keep seqn ldl trig
* save the cleaned cholesterol dataset
save "./Xiru Lyu/Data/ldl.dta", replace
* import blood pressure data
import sasxport "./Original Data/BPX_I.XPT.txt", clear
* rename variables
rename bpxsy1 sy1
rename bpxsy2 sy2
rename bpxsy3 sy3
rename bpxsy4 sy4
rename bpxdi1 di1
rename bpxdi2 di2
rename bpxdi3 di3
rename bpxdi4 di4
* compute averaged systolic and diastolic blood pressure for each participant
egen systolic = rowmean(sy1 sy2 sy3 sy4)
egen diastolic = rowmean(di1 di2 di3 di4)
* select variables of interest
keep seqn systolic diastolic
* save the cleaned blood pressure dataset
save "./Xiru Lyu/Data/bp.dta", replace
* import body measure data
import sasxport "./Original Data/BMX_I.XPT.txt", clear
* rename variables
rename bmxwt weight
rename bmxbmi bmi
rename bmxht height
* select variables of interest
keep seqn weight height bmi
* merge datasets by seqn
merge 1:1 seqn using "./Xiru Lyu/Data/demo.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/diet1.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/diet2.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/bp.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/ldl.dta"
keep if _merge == 3
drop _merge
* compute averaged intakes of fat and cholesterol
egen fat = rowmean(fat1 fat2)
egen chol = rowmean(chol1 chol2)
* drop extra columns
drop fat1 fat2 chol1 chol2
* drop rows with missing values
foreach var of varlist age bmi chol diastolic fat gender height ldl race ///
seqn systolic triglyceride weight{
drop if missing(`var')
}
* save the dataset for data analysis
save "./Xiru Lyu/Data/final.dta", replace
* transform the dependent variable
generate ldl2 = sqrt(ldl)
* fit a multiple linear regression model
regress ldl2 age i.race i.gender bmi weight height diastolic systolic chol ///
fat trig
Regression Result of Full Model (2)
* backward selection
xi: stepwise, pr(.1): regress ldl2 age i.race i.gender bmi weight height ///
diastolic systolic chol fat triglyceride
Backward Selection Result of Full Model (2)
* forward selection
xi: stepwise, pe(.1): regress ldl2 age i.race i.gender bmi weight height ///
diastolic systolioc chol fat triglyceride
Forward Selection Result of Full Model (2)
To be consistent with the model selection result performed in R (reasons to be discussed)
* transform covariates
generate triglyceride2 = triglyceride^2
generate age2 = age^2
* fit a multiple linear regression model
regress ldl2 age age2 i.race i.gender bmi weight height diastolic systolic ///
chol fat triglyceride triglyceride2
Regression Result of Full Model (3)
* backward selection
xi: stepwise, pr(.05): regress ldl2 age age2 i.race i.gender bmi weight ///
height diastolic systolic chol fat triglyceride triglyceride2
Backward Selection Result of Full Model (3)
* forward selection
xi: stepwise, pe(.05): regress ldl2 age age2 i.race i.gender bmi weight ///
height diastolic systolic chol fat triglyceride triglyceride2
Forward Selection Result of Full Model (3)
* compare AIC & BIC of two nested models
regress ldl2 age height weight diastolic triglyceride
estat ic
AIC for Model 2_a
regress ldl2 age age2 height weight bmi diastolic systolic triglyceride ///
triglyceride2
estat ic
AIC for Model 2_b
qreg ldl2 triglyceride, quantile(0.05)
qreg ldl2 triglyceride, quantile(0.25)
qreg ldl2 triglyceride, quantile(0.5)
qreg ldl2 triglyceride, quantile(0.75)
qreg ldl2 triglyceride, quantile(0.90)
(attach the plot here?)